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Abstract 

There is a great need for robust techniques in data mining and machine learning contexts where many 
standard techniques such as principal component analysis and linear discriminant analysis are inherently 
susceptible to outliers. Furthermore, standard robust procedures assume that less than half the observation 
rows of a data matrix are contaminated, which may not be a realistic assumption when the number of 
observed features is large. This work looks at the problem of estimating covariance and precision mahices 
under cellwise contamination. We consider using a robust pairwise covariance mahix as an input to various 
regularisation routines, such as the graphical lasso, QUIC and CLIME. To ensure the input covariance 
mahix is positive semidefinite, we use a method that transforms a symmetric matrix of pairwise covariances 
to the nearest covariance matrix. The result is a potentially sparse precision matrix that is resilient to 
moderate levels of cellwise contamination. Since this procedure is not based on subsampling it scales well 
as the number of variables increases. 
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1. Introduction 


Often the aim of data mining and statistics is to extract information about the relationships between 
the variables and identify any features or shucture in the data. The covariance matrix, H = var(y), where 
y ~ F, the distribution of the true data generating process, and its inverse, the precision matrix 0 = are 
fundamental components of many statistical procedures, such as principal component analysis (PCA) and 
linear discriminant analysis. However, it is well known that the classical covariance mahix is inherently 
non-robust to outliers and suffers from distortion in its eigenstructure in high dimensions ( Johnston^ 20011. 
This paper combines pairwise covariance mahix estimation with recent regularisation routines currently 
used in bioinformatics and machine learning to produce an estimated precision matrix that is robust to 
moderate levels of cellwise contamination. 


The need for robust statistics in data mining and associated fields is well known, see Barnett and Lewis 


(1994 1 for a general overview. In particular, it is desirable for learning algorithms to be stable with respect 
to noisy features and unusual fluctuations in the inputs. For example [Li| ( |2004| l considers robust incremental 
PCA applied to multi-view face modelling and Mavroeidis and Marchiori| ( |2014 1 consider the stability of 
sparse PCA in the context of feature selection in microarray gene expression data. Other situations where 
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robust techniques are important include speech recognition and neural networks, see Gales and van Dalen 


( |2007| | and |Bieroza et al.| ( |2011| ), respectively. 

In the statistics literature, robust estimation of covariance matrices has received much attention in the 
past, notably the minimum volume ellipsoid and minimum covariance determinant (MCD) estimators, pro¬ 
jection type estimators and M-estimators, see Hubert et al. (20081 for a survey. Furthermore, research into 
covariance matrix estimation and its applications is ongoing, see for example [Filzmoser et al. (2014i who 
use the MCD estimator to construct robust Mahalanobis distances to identify local multivariate outliers; 
Hubert et al. (2014[) who study the shape bias of a range of existing robust covariance matrix estimators; or 


Cator and Lopuhaa (2010[ 2012 1 who consider asymptotic expansions and establish asymptotic normality 


for general MCD estimators. 

An alternative approach is to estimate the covariance matrix in a component-wise manner based on a 
robust estimator of scale as outlined by Ma and Genton (2001). It is well known that the resulting sym¬ 
metric matrix is not guaranteed to be positive definite (PD). Methods to ensure the resulting estimator is 
PD have previously been explored by Rousseeuw and Molenberghs] ( 19931 with notable updates in the ro¬ 
bustness literature by Maronna and Zamar (|2002 1 and quite separately in the finance liferafure by Higham 


(20021. Alqallaf ef al. (20021 also proposed a pairwise approach fo covariance mafrix estimation by means 
of first Winsorising the data. The resulting Quadrant Covariance estimate does not necessarily require a 
transformation to ensure the result is positive definite. 

In practice, it is often the precision matrix, the inverse of the covariance matrix, that is primarily of 
interest. This is the case, for example, in Gaussian graphical model selection. As such, this paper is 
primarily concerned with robustly estimating the precision matrix. While there is an obvious link between 
covariance matrices and precision matrices, it is not obvious that a good (robust) estimator for one results 
in a good estimator for the other. We will employ robust pairwise covariance matrices as a starting point for 
various regularisation techniques to facilitate the estimation of robust, potentially sparse, precision matrices. 

Classical robust estimators assume that contamination occurs within a restricted subset of the observa¬ 
tion vectors, however, in recent years there has been interest in developing robust estimators that perform 
well under cellwise contamination. The cellwise contamination model was initially explored in a data min¬ 
ing context by [Alqallaf et al. (2002 1 and later defined comprehensively by Alqallaf ef al. (20091. This form 
of confaminafion is prevalenf in large, aufomafically generafed dafa sefs, found in dafa mining and bioinfor- 
mafics, where fhere is offen liffle qualify confrol over fhe inpufs. Cellwise contamination is common in the 
context of missing data, however, it represents a philosophical divergence from the traditional approach to 


robustness. Recent examples where the problem of cellwise contamination have arisen include, Farcomeni 
(2014l, |Van Aelst et al. (2012i and Agostinelli et al. (2014l. 

We perform a detailed simulation study to assess the performance of a variety of precision matrix esti¬ 
mators in the presence of cellwise contamination over a number of scenarios and levels of p while keeping 
the sample size fixed. Our resulfs are distilled from a comprehensive range of performance indices. We 
oufline fhese indices and consider fheir applicabilify fo fhe various scenarios in fhe supplemenfary maferial 
accompanying fhis arficle. 

We show fhaf fhe pairwise nafure of fhe covariance esfimafes enables fhe resulfing precision mafrix fo 
have a higher level of robusfness fhan when using sfandard robusf covariance mafrix esfimafion procedures 
in fhe presence of cellwise contamination. This is a novel result and a significant first step towards dealing 
with cellwise contamination in this context. 

The remainder of this paper is structured as follows. Section outlines the cellwise contamination 
model and highlights why standard robust techniques fail in this setting. Sections|^and|^outline the theory 
for existing pairwise covariance matrix estimation techniques and regularisation routines and we propose a 
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new procedure which combines robust pairwise covariance matrix estimation with regularisation. Sections 
l^andj^present the results of an extensive simulation study and Sectionj^summarises the important findings. 


2. Cellwise contamination 


Consider a data set X e consisting of n observations on p variables. Classically, even the most 
robust procedures are designed such that they only work when at most half of the rows in X have contami¬ 
nation present. 


Alqallaf et al. (20091 formally outline the cellwise contamination model as an extension of the standard 


Tukey-Huber contamination model which was first introduced in the univariate location-scale setup (Tukey 
1962t Huber 19641. Consider the data generating process for the n rows in X, x,- = (I - B,)y, -i- B,z;, 
where y, ~ F, the distribution of well-behaved data, z, ~ G, some outlier generating distribution and 
B, = diag(Bi,..., Bp) is a diagonal matrix, where B\,. . .,Bp are Bernoulli random variables, Bj ~ S{\,Sj). 
When y, B and z are independent we have a situation that is similar to the missing completely at random 
model, where the missingness does not depend on the values of y, see, for example, [Little and Rubin (20021. 

The structure of B, determines the contamination model. If Bi,..., Bp are fully dependent, then B, = 
Ui\, where Ui ~ S(l,e), and we recover the fully dependent contamination model, the standard model 
on which classical robust procedures are based. In this setting, the probability that an observation is un¬ 
contaminated, 1 - e, is independent of the dimensionality. Furthermore, the proportion of contaminated 
observations is preserved under affine equivariant transformations. 

In contrast, if Bi,..., Bp are mutually independent we have the fully independent contamination model, 
where each element of x, is drawn from F or G independently of the other p - I elements in x,. That 
is, contaminating observations occur independently at the univariate level. In this setting, it may be be 
unreasonable to assume that less than half the rows have contamination. Furthermore, if p is large and there 
is only one outlier in an observation vector, then down-weighting the entire observation may be wasteful. 

If the data matrix is randomly contaminated in this elementwise manner, as the number of variables 
increases, the chance that more than half the rows are contaminated increases exponentially. Formally, let 
£ be the probability that any particular element in a data matrix is contaminated. Assuming the contam¬ 
ination is randomly scattered throughout the data matrix, the probability that any particular row has no 
contamination is (1 - s)P, which quickly decays towards zero even for small values of e. For example, if 
p = 30 and e = 0.1, then the probability that any particular row remains uncontaminated is only 4%. This 
is demonstrated graphically in Figure [T] The plot on the left shows a 100 x 30 data matrix where 10% of 
the cells have been contaminated, the white cells. While virtually all the rows of the data matrix have at 
least one contaminated element, the majority of cells remain uncontaminated in the sense that they are still 
real measurements from the underlying data generating process. Even if £ = 0.03, the probability that any 
particular row is uncontaminated is 40%, however with a sample size of 100, this translates to a 98% chance 
that at least half the rows are contaminated, in which case standard robust methods fail. 

It is important to note that the fully independent contamination model lacks affine equivariance, in 
the sense that linear combinations of columns of a contaminated data set result in “outlier propagation” 


(Alqallaf et al. 20091. As such, affine equivariance is not an achievable outcome for any estimator in this 
setting. 

Existing research into the problem of cellwise contamination has focussed on coordinatewise proce¬ 
dures, that only operate on one column at a time. jCroux et al.j ([2003 j) consider an approach based on 


“alternating regressions” using weighted Li regression, Maronna and Yohm|(2008l use a coordinatewise 


procedure for principal component analysis. Eiu et al. (20031 have an application involving the singular 
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Figure 1: On the left, a heat map of a data matrix with 30 variables and 100 observations. 10% of the cells have been contaminated 
and are shown as white cells, while the uncontaminated cells are in various shades of grey. On the right, the probability that any 
particular row (observations) in the data matrix will be contaminated, 1 - (1 - e)'’, over a range of s, the proportion of cells affected 
by cellwise contamination. 


value decomposition of microarray data and De la Torre and Black (2001 1 consider cellwise contamination 
in the context of computer vision. 

We show that a pairwise approach is able to cope with much higher levels of cellwise contamination 
than existing classical robust estimators. In the simulations in Sectionj^we do not use the fully independent 
contamination model, rather, we impose restrictions on the amount of contamination in each variable. As 
such the contamination is no longer strictly independent, however, the advantage is that we are able to assess 
the impact over various known levels of contamination in each variable. 


3. Pairwise covariance matrix estimation 

A pairwise approach to estimating covariance matrices in the presence of cellwise contamination has 


previously been explored by Alqallaf et al. (20021 where the classical correlation coefficient was applied to 


a Winsorised data set. Instead of transforming the underlying data, our approach is to take the p{p - l)/2 
pairs of variables and robustly estimate the covariance between each pair. The primary advantage of this 
approach is robustness to cellwise contamination in the data set. The main disadvantage is that the resulting 
symmetric matrix is not guaranteed to be positive semidefinite or affine equivariant. However, as noted 
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earlier, in the cellwise contamination model, affine equivariance is unachievable as there is the potential for 
all rows to have a contaminated cell, hence linear combinations of the rows propagate the contamination. 


A simple method for turning scale estimators into covariance estimators was introduced by Gnanade 


sikan and Kettenring (1972) and brought to prominence in the context of robust estimation by Ma and 


Genton (20011. The idea is based on the identity. 


cov(A, Y) = ^ [var(crA + ySF) - \w{aX - pY )], 


( 1 ) 


where X and Y are random variables. In general, X and Y may have different scales, hence it is standard to 
let or = 1 / Vvar(A) and /? = 1 / Vvar(F). A robust covariance estimator is found by replacing the variance 
in ([^ with (squared) robust scale estimators. We will focus on the estimators Qn ([Rousseeuw and Croux 


1993| l, the T-scale as described in |Maronna and Zamar| ( |200^ and an estimator that is somewhat less robust 
but highly efficient, Pn, the interquartile range of the pairwise means, and its adaptively trimmed variant 


with adaptive trimming parameter d - h (Tarr et ah] 2012). We also consider the interquartile range (IQR) 


and the median absolute deviation from the median (MAD). A recent discussion on the efficiency of various 
robust scale estimators can be found in |Tarr et'ar] ( |2012] ). 

Using identity Q, a symmetric matrix full of pairwise covariances can easily be constructed, however, 
there is no guarantee that the result will be positive semidefinite. The two methods outlined below overcome 
this limitation by appropriately adjusting the eigenvalues of the symmetric matrix to ensure that they are all 
positive, and hence ensuring a positive definite result. 

3.1. Orthogonalised Gnanadesikan Kettenring procedure 


To overcome the possible lack of positive semidefiniteness in a matrix of pairwise covariances, Maronna 


and Zamar (2002) propose a modification based on the observation that the eigenvalues of a covariance 
matrix are the variances along the directions given by the respective eigenvectors. Essentially a principal 
components decomposition is performed and the covariance matrix is reconstructed using robust variance 
estimates of the principal component vectors in place of the original eigenvalues. This procedure is known 
as the Orthogonalised Gnanadesikan Kettenring (OGK) estimator. Note that even if the original covariance 
matrix was already positive definite, applying the OGK procedure will not necessarily return the same 
matrix. 

Maronna and Zam^(2002i and Maronna et al. (2006 p. 207) suggest that the OGK estimator can be 


improved by iterating the procedure and then using this estimate to find robust Mahalanobis distances for 
each observation vector. These are then used to screen for outliers before applying the classical covariance 
estimator to the cleaned data, resulting in a procedure known as the reweighted OGK. This is done in 
an effort to increase efficiency and to make the result “more equivariant”. In terms of the impact of not 


being affine equivariant, Maronna and Zamar (20021 note that “although the worst case may differ from fhe 


original data, for most transformations the results are very similar” and “the lack of equivariance is not a 
serious concern in our estimates”. 

Regardless, neither the OGK method nor the reweighted OGK method is able to cope with cellwise 
contamination. The issue of outlier propagation means that the number of contaminated principal compo¬ 
nents could easily be greater than 50% even for small levels of cellwise contamination. Hence, the robust 
variance estimates that are used in place of the eigenvalues will no longer be valid estimates - they will be 
in breakdown. Furthermore, the reweighting step will often needlessly exclude many observation vectors 
where there is only one contaminated cell. 
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3.2. Nearest positive definite matrix procedure 


Higham (20021 considers the problem of computing the nearest positive definite (NPD) matrix to a given 


symmetric matrix. The motivation stems from finance, where sample covariance mafrices are consfrucfed 
from vecfors of stock refums, however, fhe problem arises when nol all slocks are observed every day. In 
Ihis selling, classical covariances may be computed on a pairwise basis using dala drawn only from days 
where bolh stocks have dala available. The resulling covariance malrix is nol guaranteed to be PD because 


il has been buill from inconsislenl dala sels. Molivaled by Ihe same problem, Lpland el al. (20131 propose 
bolh a pseudo-likelihood and a Bayesian approach to find PD eslimales of pairwise correlalion malrices. 
However, Iheir approach relies on experl knowledge to formulate priors for Ihe pairwise covariances. 

The NPD procedure is similar to Ihe OGK procedure in lhal il performs a speclral decomposilion and 
Ihen updates Ihe eigenvalues to ensure lhal Ihe resull is PD. However, il does nol rely on linear Iransfor- 
malions of Ihe original dalasel and hence is nol affected by Ihe “oullier propagalion” issue associated wilh 
cellwise conlaminalion. Formally, for an arbilrary symmelric px p malrix A, Ihe aim is to find Ihe dislance 


y(A) = min{||A - W||f : W is a symmelric PD malrix}. 


( 2 ) 


and Ihe resulling malrix lhal achieves Ibis minimum dislance. |Higham] (|2002]) uses Ihe Frobenius norm, 
IIBIb = VWB), as il is “Ihe easiesl norm to work wilh for Ibis problem and also being Ihe nalural choice 
from Ihe slalistical poinl of view”. 

While Higham ( |2002 1 considers a variely of weighling mechanisms, in Ihe simplesl case wilhoul spec¬ 
ifying any weighls, Ihe procedure is quite slraighlforward. The final eslimale is W = EAE', where EA E/ 
is Ihe speclral decomposilion of A, wilh A = diag(/li,..., Ap) and A = diag(max{/l,-, d)), where d is a small 
posilive conslanl. In conlrasl to Ihe OGK procedure, if Ihe initial symmelric malrix is already PD, Ihen Ihe 
NPD melhod simply relums Ihe original pairwise covariance malrix. 

In Ihe presence of cellwise contamination the NPD method outperforms the OGK method. However, the 
NPD method often results in estimated matrices with a number of extremely small eigenvalues which give 
poorly conditioned estimates, i.e. the condition number of these estimators is very high as is the entropy 
loss, which involves the log of the eigenvalues. In general, it is not recommended to use either the OGK nor 
the NPD in isolation when there is cellwise contamination present. Even in the presence of standard row¬ 
wise contamination, the NPD method is not recommended due to its propensity to return poorly conditioned 
estimates. 


4. Precision matrix estimation 

Many statistical procedures are primarily concerned with the precision matrix, the inverse of a co- 
variance matrix, rather than the covariance matrix itself. For example, finding Mahalanobis dislances and 
performing linear discriminanl analysis bolh require an estimate of 0 = 11“^. Finding good precision ma¬ 
lrix estimates has been a focus of many investigators over a long period of time, Ihe firsl major conlribulion 


being |Dempsteq ( | 1 972] l . 

The following routines lake as an inpul an estimated covariance malrix and oulpul a regularised pre¬ 
cision malrix. In Section we demonslrale Ihe advanlages of using a robusl pairwise covariance malrix 
estimate as Ihe inpul to Ihese regularisalion routines. 

4.1. GIASSO 

A nalural way to estimate 0 is by maximising Ihe log-likelihood of Ihe dala. Wilh Gaussian observa¬ 
tions, Ihe log-likelihood lakes Ihe form. 


log |0| - lr(S0), 
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where S is an estimate of the covariance matrix of the data. Maximising Q with respect to 0 leads to the 
MLE, In general, will not be sparse, in the sense that it will contain no elements exactly equal to 
zero. Furthermore m p > n situations S will be singular so the MLE cannot be computed. [Yuan and Lin 
(|2007|) consider minimising the penalised negative log-likelihood. 


tr(S0) - log |0| A Zij \0ij\, (4) 

over the set of PD matrices where 4 is a tuning parameter to control the amount of shrinkage. [Friedman et al. 


(20081 refer to this estimator as the graphical lasso (GLASSO) and note that it has two major advantages 


over the solution is PD for all 4 > 0 even if S is singular, and for large values of 4 the resulting estimate, 
0, will be sparse. 

4.2. QUIC 

The QUIC method solves the same minimisation problem as the GLASSO. The improvement in speed 
comes from noticing that the Gaussian log-likelihood component of (|^ is twice differentiable and strictly 


convex which lends itself to a quadratic approximation and hence faster convergence ( jHsieh et al.j [201 Ij ). 
On the other hand, the penalty term is convex but not differentiable and so is treated separately. 

The QUIC routine, as implemented in the R package QUIC, explicitly includes a step that ensures pos¬ 
itive definiteness of the precision matrix for each iteration. Work has recently been undertaken to scale the 
QUIC estimator to scenarios with a million variables, see Hsieh et ^(|2013|l for details. 


4.3. CLIME 

An alternative to maximising the penalised log-likelihood is to use the constrained Ii minimisation 


approach to sparse precision matrix estimation (CLIME), implemented in the R package clime (Cai et al. 


2011[ 20121. The CLIME routine uses linear programming to solve the following (convex) optimisation 
problem, 

0* = min |0|i subject to: |S0 - lU < 4, 

where S is the sample covariance matrix and |A|i = ^jj |a,y| is the elementwise norm of a matrix. A, 
and |A|oo - max,-y |a,y| is the elementwise infinity norm. No symmetry requirements are placed on 0* so a 
symmetrising step is applied to obtain the final solution, 0, 

Bij - h = eLrnn < \9*\} + eti{\en > 10 * 1 ). 


Cai et al. 


(2011 1 shows that the resulting 0 is PD with high probability. 


Theorem 1 of 

Our simulations show that there is little difference between using CLIME and QUIC - the key point is 
that both appear to perform well in the presence of cellwise contamination when the input matrix is based 
on pairwise robust covariance estimates and it has been made PD using the NPD routine. 


5. Simulation study for p < n 

This section presents the results of an extensive simulation study to assess how well various robust 
covariance estimation techniques perform when used as an input to the regularisation routines outlined 
previously. 

The proposed estimator begins by finding the covariances between all p{p - l)/2 pairs of variables. 
For the scale estimator underlying the robust covariance estimator, we consider 2„, the r-scale, the MAD 
and the IQR. We also consider the P„ estimator and two adaptively trimmed variants P„, with trimming 
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Figure 2: Heat maps of the three kinds of precision matrices used to generate the data when p = 30. 


parameters d - 3 and d = 5, see Tarr et al. ( 2012| l for further details about these estimators. The pairwise 
covariances are arranged in a symmetric, though not necessarily PD, matrix. The symmetric matrix is 
transformed to a PD matrix using either the OGK method or the NPD method before being input into the 
GLASSO, QUIC or CLIME regularisation routines. For comparison purposes we also include the classical 
covariance estimator and the MCD as initial covariance matrix estimates. 


5.1. Design 

The simulated data follows a multivariate Gaussian distribution with n - 100 observations, NiO, 0“^). 
The precision matrices we select as the basis for the data generating process represent a broad range of 
scenarios that occur in practice and are similar to those used in Cai et aL| ( |2011| l and |Hsieh et al. (20111. In 
particular, we consider three types of precision matrices, 0, as shown in Figureand outlined below. 


1. Banded precision matrices, with elements = 0.6^' such that the values of the entries decay the 
further they are from the main diagonal. 

2. Sparse precision matrices, with randomly allocated non-zero entries, where 0 = B -f dl with each off 
diagonal entry in B generated independently, where P{bij - 0.5) = 0.1 and Pibij = 0) = 0.9 and 6 
is chosen such that the condition number of the matrix equals p. The matrix is then standardised to 
have diagonal components equal to one. This scenario will be referred to as scattered sparsity. 

3. Dense precision matrices, where 0 has all off diagonal elemenfs equal fo 0.5 and diagonal elemenfs 
equal fo 1. 


The oufliers were generated independenfly for each variable. In our simulations we allow fhe number 
of confaminafed observafions wifhin each variable fo increase up fo a maximum of 25 observafions (ouf 
of n = 100). In fhis way we have complete confrol over fhe fofal number of confaminafed cells. The 
disfribufion of fhe oufliers is a tio disfribufion scaled by eifher a factor of 10 for exfreme oufliers or VTO 
for moderafe oufliers. The moderafe oufliers are perhaps closer fo whaf one mighf expecf in a real dafa 
sef. However, the focus here is primarily on the extreme outliers where the overwhelming majority of the 
unusual observations lie well outside the cloud of standard observations. The extreme nature of the outliers 
serves to demark clearly estimators that have effectively broken down from those that are still capable of 
giving ballpark correct results. In both cases, the outliers are symmetrically distributed. 

Each of the regularisation routines require a tuning parameter. At each replication, the tuning parameter 
was obtained by training on a separate (uncontaminated) randomly generated data set drawn from the true 
data generating process. For the training data, a sequence of precision matrices was obtained and the value 


8 


































of the tuning parameter corresponding to the smallest entropy loss was then used for that replication. In 
practice, there was a small amount of variability in the choice of tuning parameter within each scenario 
and dimensionality. Furthermore, the QUIC and GLASSO routines almost always picked the same tuning 
parameter and the CLIME routine was free to choose slightly smaller tuning parameters. For example, in 
the scattered sparsity scenario with n = 100 and p = 60 the training set for CLIME resulted in a tuning 
parameter of around 0.09 whereas for QUIC and GLASSO it was closer to 0.12. In practice the tuning 


parameter may be chosen through cross validation, a BIC-type criterion such as in Yuan and Lin (20071, or 
it can be adjusted in an ad-hoc way until a desired level of sparsity is achieved. 

We performed an extensive investigation into the most appropriate way to compare estimated precision 
matrices in the presence of cellwise contamination. We considered a number of matrix norms, the Frobenius 
norm, one norm, infinity norm and spectral norm, as well as the log determinant, the condition number and 
the entropy loss. The definition and behaviour of these performance indices under cellwise contamination 
are outlined in the supplementary material. We found that the most appropriate measure was the entropy 
loss defined as, Le{0, 0) = fr(0“^0) - logdef(0“^0) - p. 

As in Lin and Perlman ( 1985[ l, we reporf fhe resulfs in ferms of fhe percenfage relative improvemenf in 
average loss (PRIAL), 


PRIAL(0) = 


Lj;(0,0o)-Lg(0,0) 

L£(0,0o) 


X 100, 


over A = 100 replication of each design, where 0o is fhe esfimafed precision mafrix after a regularisa- 
fion fechnique has been applied fo fhe classical sample covariance mafrix for unconfaminafed dafa. If is 
imporfanf fo note fhaf fhis is an exfremely harsh benchmark fo sef. 

5.2. Results 

5.2.1. No contamination 

Any good robusf mefhod should give comparable resulfs fo fhe classical non-robusf mefhod if is re¬ 
placing when presented wifh a clean dafasef. Table [TJpresenls fhe PRIAL resulfs for fhe no contamination 
scenario. As the PRIAL results are relative to the base case for each routine. Table [T] cannot be used to 
compare the performance of the CLIME routine to the QUIC routine. 

In the uncontaminated case, the OGK method substantially outperforms the NPD method. Overall, 
the methods appear to improve as the dimensionality increases, however, this is more a reflection of the 
deteriorating absolute performance of the baseline classical covariance matrix estimate. 

For p = 30, the pairwise methods outperform the MCD, however the MCD method uses \n + p + \\I2 
observations so when p = 90 the resulting estimator is the classical covariance estimate applied to 95 out 
of a total n = 100 observations. Hence, it is not surprising that the PRIAL for the MCD method is so close 


to zero. In fact, the MCD is not recommended for use when n <2p (Rousseeuw et al. 20131. 


The reweighted OGK (OGKw) methods essentially perform outlier detection and deletion before re¬ 
turning a classical covariance estimate of the cleaned data set. The performance of these methods is broadly 
similar over all the various initial scale estimates. Though not shown in Table [T| the MAD performs partic¬ 
ularly poorly under both the OGK and the NPD corrections and would not be recommended for use. 

As would be expected, given the solid Gaussian performance of P„ (see Tarr et al. ( 2012| |), the methods 
based on outperform those based on the r-scale and Qn. The relative deterioration in performance for 
the robust methods compared to the classical method is comparable to that in the simple univariate scale 
case. For example, the univariate scale estimator P„ has an asymptotic Gaussian relative efficiency of 86%. 
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5.2.2. Cellwise contamination 

There are a number of ways to compare and contrast the various estimators. We consider data with 
n - 100 observations from three different data generating processes, across four dimensions, p - 15, 30, 
60 or 90, contaminated with either moderate or extreme outliers, as explained in Section HD We present 
figures for fhe exfreme case only buf commenf on bofh based on fhe resulfs of = 100 replicafions of each 
design. We implemenf an array of inifial covariance esfimafion fechniques and process fhese fhrough fhe 
GLASSO, QUIC and CLIME regularisafion routines. Finally, as ouflined in fhe supplemenfary material 
fo fhis arficle, fhere are a number of performance indices fhaf are considered. This secfion exfracfs and 
synfhesises fhe key resulfs. 

We firsf consider fhe effecf of dimensionalify on fhe performance of fhe various esfimafors. A fypical 
example is shown in Figure where we plof fhe PRIAF resulfs for fhe precision mafrix resulfing from fhe 
CFIME procedure for various inpuf covariance mafrices across differenf amounfs of exfreme confamina- 
fion in each variable. The original dafa was generated assuming a banded precision mafrix, however fhe 
frend holds frue for scaffered sparsity and dense precision mafrices as well as for fhe QUIC and GFASSO 
procedures. 

For relafively low dimensions, such as in fhe boffom panel of Figure where p - 15, fhere is clearly 
an advanfage fo using fhe NPD mefhod over fhe OGK mefhod once fhere is more fhan a few percenf of 
observations in each variable being confaminafed. To avoid clutter, only fhe OGK mefhod wifh P„ has been 
included in fhe plofs, however, if is represenfafive of fhe performance of fhe ofher scale estimators when 
used in conjunction wifh fhe OGK mefhod. 

As fhe dimensionalify increases, fhe OGK and fhe MCD mefhods deferiorafe fasfer. When p - 90, as 
ouflined in fhe previous secfion, fhe MCD mefhod behaves like fhe classical mefhod. The OGK mefhod 
performs similarly poorly as ouflier propagafion can lead fo more fhan half of fhe elemenfs in each principle 
componenf vecfor being confaminafed. Hence, fhe eigenvalues in fhe specfral decomposifion are replaced 
wifh robusf esfimafes of scale fhaf may no longer be valid. 

Remarkably, fhe NPD mefhods perform consisfenfly well. Their performance, relative fo fhe classical 
mefhod wifh no confaminafion improves as fhe number of variables increases. The P„ based mefhod per¬ 
forms well for low levels of confaminafion, however once fhe proporfion of confaminafed cells is greafer 
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Table 1: PRIAL results for the various estimators when there is no contamination present. 
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Figure 3: PRIAL results for a selection of estimators applied to data generated with a banded precision matrix with extreme outliers 
for p = 90 (top), p = 30 (middle) and p = 15 (bottom) using the CLIME regularisation procedure. 
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than 10% it does not perform as well as the other pairwise methods due to its lower breakdown value. 

It is interesting to note that the adaptively trimmed P„ with adaptive trimming parameter d - 3, P„, 
follows a somewhat different trajectory to the rest of the NPD type estimators. It maintains a relatively 
high level of performance even for quite high levels of contamination. This is due to the extreme nature 
of the contamination making the adaptive trimming extremely effective in identifying and excluding the 
errant observations. The advantage of P„ is lost when the contaminating distribution has only moderately 
sized outliers, in which case all the NPD pairwise methods perform comparably because P„ does almost no 
trimming. 

To summarise, for p = 30, p = 60 and p = 90, using a pairwise method in conjunction with the 
NPD procedure as an input into the CLIME regularisation routine, the increase in entropy loss can be 
contained to less than double that of the classical method without contamination if the proportion of cellwise 
contamination is less than 10%. 

The same pattern holds true when using the QUIC or the GLASSO regularisation routines. To demon¬ 
strate this consider Figure|^where the PRIAL results are shown for CLIME, QUIC and the GLASSO under 
the banded precision matrix scenario with extreme outliers and p = 60. As would be expected the QUIC and 
GLASSO results are essentially identical, and largely consistent with the CLIME results in the top panel. 

Consulting the raw entropy loss numbers reveals that the CLIME method gives slightly lower average 
entropy loss measurements, particularly for very high levels of contamination. In practice it does not matter 
what regularisation routine is used, the benefits of taking a pairwise approach to covariance estimation in 
the presence of cellwise contamination will still hold. 

The NPD pairwise approach is a major improvement over standard robust estimators. An example of 
this is given in Figure where we present the average PRIAL results for the QUIC estimator with p = 30 
for the scenarios illustrated in Figure Across all scenarios the same general pattern holds, the classical 
method and the OGK and MCD methods fail quite rapidly whereas the NPD approach offers much greafer 
resilience fo the cellwise contamination. 

For the banded precision matrix scenario, top panel of Figure the NPD based methods under the 
various robust scale estimators give similar results with P„ having a slight advantage over the others for low 
levels of contamination whereas has an advantage for higher contamination proportions. 

For the scattered precision matrix and the dense precision matrix scenarios, gives the best results. 
The advantage of the adaptive trimming procedure is lost when the outliers are not so extreme, however, in 
such scenarios the adaptive trimming approach performs no worse than the other NPD methods. 

We previously established that matrix norms are not a good performance measure for precision matrices. 
In terms of the other performance indicators, for all scenarios considered the log condition number remained 
bounded, suggesting that all three regularisation routines return well conditioned precision matrix estimates 
regardless of the level of contamination or the data generating process. 

The NPD also performed well in terms of the log determinant performance index. As with the entropy 
loss, there appears to be an advantage to using P„ over the other scale estimators in each of the scenarios. 
Unlike with the entropy loss, the advantage of the adaptive trimming procedure is still evident even when 
the contamination is less extreme. 

It is also constructive to see how 2 = 0“', the inverse of the estimated regularised precision matrix, 
compares with the true covariance matrix 2. Figure [^presents the average entropy loss and Frobenius norm 
results for the resulting estimated covariance matrices after regularisation using the CLIME procedure. We 
see similar trends to those outlined earlier. While using P„ alone does not perform well when the amount 
of contamination in each variable is large, the adaptive trimming procedure gives excellent results. The 
other pairwise methods also perform quite well. However, as we would expect, the classical method and 
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Figure 4: PRIAL results for a selection of estimators applied to data generated with a banded precision matrix with extreme outliers 
for p = 60 using CLIME (top), QUIC (middle) and GLASSO (bottom). 
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Figure 5: PRIAL results for a selection of estimators applied to data generated with a banded precision matrix (top), scattered 
precision matrix (middle) and dense precision matrix (bottom) with extreme outliers for p = 30 using the QUIC routine. 
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standard robust techniques, MCD and OGK fail quite rapidly. In general, the patterns for the matrix norms 
applied to £ are very similar to those of the entropy loss for 0. That is, the robust covariance matrices that 
are obtained at the end of the proposed procedure perform similarly well to the robust regularised precision 
matrices. 

There are also important differences between £ and the initial pairwise covariance matrix obtained after 
applying the NPD procedure. While the matrix norms for the the initial pairwise matrices were comparable 
to those for £, the entropy loss and log determinant results for the initial pairwise covariance matrices were 
much worse due to small eigenvalues resulting from the NPD method. As such, we would not recommend 
simply applying the NPD procedure to a pairwise covariance matrix without further regularisation. 


5.3. Gaussian graphical discovery rates 

Another way to analyse the performance of a precision matrix estimator is through the lens of a Gaus¬ 
sian graphical model. When the data follow a multivariate Gaussian distribution, pairwise conditional 
independence between variables Xj and Xj. holds if and only if Ojk - 0, therefore inferring linkages between 
variables corresponds to identifying the nonzero elements of 0 = {6jk), see Lauritzen (19961 for further 
details. Hence, rather than focussing on overall measures of similarity between the estimated precision and 
the true precision matrix, it can be informative to see how often the estimated precision matrix identifies fhe 
correcf non-zero elemenfs from fhe frue precision mafrix. 

Figure]^ shows visually how well fhe QUIC esfimafor performs in fhe presence of cellwise confamina- 
fion. When fhere is no confaminafion, all mefhods appear fo perform similarly well in ferms of fheir ability 
fo correcfly idenlify fhe frue non-zero elemenfs in fhe precision mafrix. However, in fhe presence of 10% 
exfreme confaminafion, fhe classical covariance and fhe MCD approach bofh fail fo idenlify any struclure 
as Ihey lend fo refum overly dense precision malrices. On fhe olher hand, fhe pairwise robusl mefhods are, 
on average, slill able fo idenlify fhe underlying sfruclure. 

In fhe machine learning lileralure, fhe Mallhews correlalion coefficienl (MCC), also known as fhe (p 
coefficienf in fhe slafislics lileralure, is offen used fo assess fhe ability of an esfimafor fo idenlify fhe frue 
non-zero elemenfs in a precision mafrix (Mallhews 19751. If lakes info accounl fhe number of frue posifives 
(TP), false positives (FP), frue negatives (TN) and false negafives (FN), 


TP X TN - FP X FN 

MCC = - . 

V(TP -t FP)(TP -t FN)(TN -t FP)(TN -F FN) 

Typical MCC resulls are given in Figure for fhe QUIC procedure. When cellwise contamination 
is introduced, the MCD, OGK and classical covariance approaches lose their ability to identify the true 
structure in the precision matrix quite quickly. The pairwise methods are much more resilient. As the 
number of contaminated observations in each variable increases, the ability of the pairwise methods to 
identify the true structure decreases gradually. When there is no contamination and p - 60, the classical 
method has an MCC of 0.39 compared with an MCC of 0.35 for the pairwise method based on P„. When 
there is 5% extreme contamination in each variable, the MCC for the P„ based method is still at 0.29, while 
the classical approach is at 0.05. This pattern of results is virtually unchanged for the pairwise methods 
when the contamination is less extreme. 


6. Simulation study for p > n 

Of particular interest in a data mining context is the case when the number of variables p, is larger than 
the number of observations n. In order to perform simulations in a reasonable amount of time we refor¬ 
mulated the simulation settings in Section such that samples of size n = 50 were drawn with dimension 
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Figure 6: Average entropy loss results for the precision matrices resulting from the CLIME procedure (top) and average entropy 
loss and Frobenius norm results for the resulting covariance matrix estimates (middle and bottom) for p = 60 with scattered sparsity 
and extreme outliers. 
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Figure 7: Heat maps showing how often each element in the precision matrix is identified as being non-zero using the QUIC routine 
over 100 replications. The top half have no contamination and the bottom half have 10% extreme contamination. 
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Figure 8: Matthews correlation coefficient results for the QUIC procedure with extreme outliers, p = 30 (top), p = 60 (middle) 
and p = 90 (bottom) in the scattered sparsity precision matrix scenario. 
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p = 15, 30, 60 and 90. The same types of precision matrices were considered as in the previous section, 
though we also looked at the case where the condition number of the sparse precision matrix was much 
larger than the dimension. Given the similar performance of the various regularisation routines found in the 
previous section, we restricted attention to the QUIC routine. 

The results are similar to those found in the p < n setting. Figure [^presents the MCC results for the 
case when p - 60 and n = 50. We can draw a direct comparison between Figure]^ and the middle panel of 
Figure where p - 60 but n = 100. An important difference is that when p > n, the MCC values are lower 
across all levels of contamination, indicating that it is more difficult to recover the support of a Gaussian 
graphical model. For example, the classical approach had a MCC of 0.39 when n - 100, but only 0.29 when 
n - 50. Figure [fallow us to compare the relative performance of the pairwise techniques as the condition 
number of the true precision matrix increases from 60 to 1000. The baseline value for the entropy loss is 
8.8 for the classical approach when the condition number is 60, which increases slightly to 10.0 when the 
condition number is 1000. We observe that the performance of the various pairwise estimators decreases 
slightly as the condition number increases. For example when there is 8% contamination in each variable, 
the adaptively trimmed estimator, has a PRIAL of -46% when the condition number is 60 (top panel), 
which decreases to -72% when the condition number is 1000 (bottom panel). 
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Figure 9: Matthews correlation coefficients results for the QUIC procedure for a scattered precision matrix with p = 60, n = 50, 
and extreme outliers. 

Figure [^demonstrates the impact of changing the extremity of the outliers when p = 90 and n = 50. 
As described in Section]^ the outliers are generated from a multivariate t\Q distribution with scale matrix 
k\p, for k - 10, 50 and 100. With moderate outliers, as shown in the top panel of Figure [TT| all the robust 
procedures perform comparably and the classical approach is not affected too badly. As the extremity of the 
outliers increases, shown in the bottom two panels of Figure the performance of the classical approach 
deteriorates quickly. Between the middle and bottom panels, there is little difference in the performance of 
the high-breakdown value robust estimators, indicating that they have stabilised and are likely to continue 
giving the same result even if the existing outliers were moved further away. It is clear that the method based 
on Pn, with its lower breakdown value, continues to be affected by the size of contamination when there is 
a large proportion of contamination in each variable and we would expect its performance to continue to 
deteriorate if the outlier generating distribution was even more extreme. 

Note that P„, the adaptively trimmed P„, remains the most stable as the extremity of the outliers in- 
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Figure 10: Entropy loss PRIAL results for the QUIC procedure with a scattered precision matrix, p = 60, n - 50 and extreme 
outliers. The condition number of the underlying precision matrix is 60 in the top panel and 1000 in the bottom panel. 
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creases. In fact, when there is 10% cellwise contamination the PRIAL remains at -59% whether k - 50 
or 100, which is down from -47% when k = 10. This can be attributed to the adaptive trimming correctly 
identifying the vast majority of the contaminated bivariate observations when the outliers are so extreme. 

7. Conclusion 

A pairwise approach to covariance estimation has a natural resilience to the type of cellwise contamina¬ 
tion seen in high dimensional scenarios where classical robust procedures, such as the MCD, M-estimators, 
Quadrant Covariance and OGK, tend to fail. 

We considered a broad range of scenarios: from dense precision matrices, as typically found in standard 
analyses with n » p; to banded precision matrices that often occur in time series settings and may also 
be representative of scenarios with block diagonal precision matrices; as well as scattered sparsity, where 
the linkages between variables are not known beforehand and can show up anywhere within the precision 
matrix, as is often found in settings where p > n. 

After careful consideration of the various performance indices available in the multivariate setting, out¬ 
lined in the supplementary material, our primary choice of performance measure was the entropy loss. 
When appropriate, we showed that the entropy loss returned similar conclusions to other performance in¬ 
dices, such as the Frobenius norm and log determinant. 

We have shown that combining robust pairwise covariance estimation with the NPD method and reg- 
ularisation techniques such as the CLIME, QUIC or GLASSO yield precision matrices that are robust to 
cellwise contamination. The additional advantages of the regularisation techniques, such as the promotion 
of sparsity also carry through. While it was expected, given that they are solving the same minimisation 
problem, it is reassuring to find that the QUIC estimates are virtually indistinguishable from the standard 
GLASSO approach in all scenarios considered here. Furthermore, it did not appear to matter which of the 
three considered regularisation routines was applied, as all gave broadly similar results in the various sce¬ 
narios considered. This is comforting given the current pace of research in this area, with new procedures 
being suggested frequently. 

We demonstrated that the proposed approach maintains its ability to identify the true precision matrix 
structure, as measured by the Matthews correlation coefficient, under moderate levels of contamination. 

We also investigated what happens when the resulting precision matrix is inverted to find fhe corre¬ 
sponding covariance mafrix esfimafe. Applying fhe same performance indices fo fhe resulfing covariance 
mafrices, we found fhey perform similarly well fo fhe underlying precision mafrix. 

The simulafion sfudy allowed for quife high levels of arbifrary confaminafion in mulfivariafe dafa sefs. 
As such, fhe pairwise fechniques based on fhe sfandard esfimafor unsurprisingly did nol perform as well 
as Q„ and r-scale esfimafors, however, fhe adaptively frimmed P„, Pn wifh frimming parameter d = 3 
fypically performed exfremely well, due fo ifs abilify fo deled and Irim exlreme oulliers in bivariale space. 
Finally, we showed lhal fhe performance of fhe proposed lechnique continues fo perform well even when p 
is moderalely larger lhan n. 
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Figure 11: Results for the QUIC procedure for a banded precision matrix with p = 90 and n = 50. The outlier generating 
distribution is a multivariate fio distribution with scale matrix IOI90 (top), SOIgo (middle) and 100 190 (bottom). 
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Abstract 


This supplementary material gives a summary of the performance indices used in the article “Robust esti¬ 
mation of covariance and precision matrices under cellwise contamination” by|Tarr et al. (2014i. 


1. Performance indices 

We require a way to assess the performance of various covariance and precision matrix estimators under 
cellwise contamination. There are a range of possible ways to measure how close an estimated matrix is to 
the true value. In order to assess the performance of our proposed estimators, we first need to identify which 
performance indices are appropriate. One class of performance indices considered are matrix norms which 
measure the size of a matrix. The second class looks at how closely the estimated precision (or covariance) 
matrix reflects the nature of the theoretical precision (or covariance) matrix, through either the determinant, 
condition number or an overall entropy loss index. Let H denote the true covariance matrix and 0 = 2“' 
denote the true precision matrix. In this section we define the performance measures and consider their 
appropriateness in the context of cellwise contamination. 

1.1. Measures of performance 

1.1.1. Matrix norms 

The Frobenius norm is perhaps the most common matrix norm, it is an element-wise norm, the Eu¬ 
clidean norm of A treated as if it were a vector of length p^, ||A||f = Vtr(A'A). An alternative way 
of constructing a matrix norm is to take a vector norm and use it to generate a matrix norm of the form 
||A|| = supii^ii^j ||Ax||, where || • || on the left is the induced (or operator) norm and || • || on the right is a vector 
norm. Examples of induced norms are the Lp norms. Eor example, the one norm, ||A||i = maxy Y!i=\ \oij\', 
the infinity norm, ||A||oo - max, y]”^j and the spectral norm, ||A||2 = o‘max(A), where cri„ax(A) is the 
largest singular value of A. When A is nonsingular ||A“'||2 = l/o~i-nin(A) where crniin(A) is the smallest 
singular value of A. 

In our experiments, we apply the matrix norms to A = 0o - I where 0o = 0“'0. While 0 and 0 are 
symmetric, it is not the case that the product of two symmetric matrices yields a symmetric matrix, hence 
in general 0o 0 q, so in practice the one norm and the infinity norm may yield different results. 
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For each norm, one may naively assume that the closer to zero the better, however, in Section 1.2 we 


demonstrate that this is not always the case, particularly when estimating precision matrices in the presence 
of outliers. 

Aside from matrix norms, there are a few other commonly employed performance indices. 

1.1.2. Entropy loss 

The entropy loss, as suggested by|Stein|(|1956|) and featured in James and Stein (19611 and alsopey and 


Srinivasan (19851, when applied to precision matrices is defined as. 


L£(0,0) = tr(0-'0) -logdet(0“'0) -p 


= - log - F’ 

r=l 


where T,, / = l,...,p, are the eigenvalues of 0o. Stein (19561 notes that this function is “somewhat 


arbitrary” but it is convex in 0 and assuming that 0 and 0 are positive semidefinite, Le{&, 0) > 0 with 
equality if and only if 0 = 0. 

However, the entropy loss is not as “arbitrary” as it may seem at first. Note that the Kullback-Leibler 
divergence from El) to Af2(/t,22) is, DklCM, Af 2 ) - ^T£:(Ei,E 2 )- The close link between the entropy 
loss and the Kullback-Leibler or Bregman divergence loss is shown in a Bayesian context by [Gupta and 
Srivastava| ( |2010| ). 

There is also a clear link between the entropy loss and the likelihood ratio test for //q : E = E* with 
unknown mean assuming the data come from a Gaussian distribution, - hLeCE,*, S), where Zq and 

Zi are the log-likelihoods under the null and alternative hypotheses, see, for example [Mardia et al.|(|1979| p. 
126). 

The entropy loss is used extensively as a basis for developing and assessing improved precision and 


covariance matrix estimators, for example in Lin and Perlman (19851, Yang and Berger (1994) and more 


recently. Won et al. (20131. 


1.1.3. Log determinant 

In the multivariate Gaussian setting, |Wilks| (1932i names the determinant of the covariance matrix, 
det(E), the generalised variance. The generalised precision is similarly defined as the determinant of the 
precision matrix, det(0). This idea can be used as the basis for a performance index. Consider the log of 
the determinant of the standardised covariance or precision matrix, L£)(0o) = logdet(0o) = logd;. 
The determinant of an identity matrix is 1, so the optimal value of L/)(0o) is 0. Positive (negative) log 
determinant results indicate that the generalised variance or precision is being over (under) estimated. Note 
that, L/)(0o) - -Ld(^o)- Thus, methods that underestimate the generalised variance will overestimate the 
generalised precision. 

The log determinant is a very crude performance index which can be dominated by one eigenvalue that 
is very close to zero. Furthermore, it is incorporated as part of the entropy loss so there is little need to focus 
on it in the results of the simulation studies. 


1.1.4. Log condition number 

Formally, the condition number of a square matrix is the product of the norm of the matrix and the norm 
of its inverse, a:(0o) = l|0o'll ' hence depends on the kind of matrix-norm. It is common to use 

the spectral norm, in which case the condition number is the ratio of the largest to the smallest non-zero 
singular value of the matrix. The condition number associated with the systems of equations. Ax = b, gives 
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Figure 1: A series of boxplots showing the distribution of the largest through to the smallest eigenvalues of an estimated covariance 
matrix, S, over N - 100 samples from N(0, 1) with n - 100 and p = 30, 60 and 90. 


a bound on how inaccurate the solution may be. A system is said to be ill-conditioned if small changes in 
the inputs, A and b, result in large changes in the solution, x. 

Consider the performance index defined as the log of the condition number of 0o^ 

L^(0O) ^ log x(0o) ^ log(crniax(0o)) “ log(crn,in(0o)). 

The condition number for an identity matrix is 1 and the condition number for a singular matrix is 
infinity, so the 0 < L^{&o) < oo. 


Gentle (2007 1 notes that while the condition number of a matrix provides a useful indication of its 


ability to solve linear equations accurately, it can be misleading at times when the rows (or columns) of the 
matrix have very different scales. That is, the condition number can be changed by simply scaling the rows 
or columns which does not actually make a linear system of equations any better or worse conditioned. This 
is known as artificial ill-conditioning. 


In the context of the sample covariance matrix, S, Ledoit and Wolf (20041 note that “when the ratio p/n 


is less than one but not negligible, the sample covariance matrix is invertible but numerically ill-conditioned. 


which means that inverting it amplifies estimation error dramatically.” Won et al. (20131 go further stating 
that “the eigenstructure [of S] tends to be systematically distorted unless p/n is extremely small, resulting 
in numerically ill-conditioned estimators for E.” Figure [T] demonstrates the systematic deterioration in the 
eigenstructure as p/n ^ 1. The eigenvalues of the true covariance matrix are all identically 1, however this 
is not reflected in the eigenvalues of the estimated sample covariance matrices. 

As with the log determinant, the log condition number is not a particularly discerning performance 
index. To assess whether a robust estimator provides reasonable estimates, the most that it can contribute is 
whether or not the log condition number remains bounded. 

1.1.5. Quadratic loss 

Another index that is frequently used in the literature to assess the performance of covariance matrix 


estimators is the quadratic loss. The exact specification varies from paper to paper, for example Ledoit and 
Wolf (2004 1 define it as, ||E - E||g. An alternative specification of the quadratic loss, more in line with the 


Won et al. 


entropy loss, is used in 
linked to the Frobenius norm. 


(|2013 1, ||EE ^ - I||^. It is obvious that the quadratic loss is intrinsically 
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1.2. Behaviour of performance indices 

The performance indices outlined in this section are typically used to compare competing estimators 
in uncontaminated data sets. Contaminated data will have potentially severe implications for structure and 
size of the estimated precision and covariance matrices, and it is not clear how these indices will behave in 
such settings. As such, we begin our investigation by exploring how these indices react to the presence of 
gross outliers in a data set. 

The model used to assess the behaviour of the various performance indices is typical of that used in the 
simulation study, n = 100 observations drawn from the standard multivariate Gaussian distribution, Af(0,1), 
with p = 30. 

1.2.1. Inflated variances 

This section explores how the various performance indices react if we artificially inflate the variance of 
the first variable, i.e. increase the value of in the sample covariance matrix, S = {sij), based on a single 
sample of uncontaminated data. In this simple case, where H = 0 = I, the matrix norms are applied to S -1 
or - I and the entropy loss, log determinant and log condition number are simply applied to S or S“^. 
Note that in this setting the one norm and the infinity norm will give identical results and so only the results 
for the one norm are shown. 

Figure l^shows the behaviour of the various indices when applied to these adjusted covariance matrices 
and Figurej^presents the same for the resulting precision matrices, 0 = The horizontal axis shows the 
size of i'll, the artificially inflafed variance of fhe firsf variable in fhe sample covariance mafrix. 

In Figure]^ fhe majorify of fhe performance indices behave similarly in fhe covariance case - fhere is 
an overall posifive frend as fhe variance of fhe firsf variable increases. The specfral norm and Frobenius 
norm bofh increase uniformly wifh The one norm, and correspondingly fhe infinify norm (nof shown), 
remains fiaf as fhe sum of fhe absolufe value of fhe elemenfs in anofher column (or row) remains larger fhan 
fhe firsf column (row) up unfil fhe poinf where w 2, af which poinf fhe one norm (and infinify norm) 
increase linearly wifh ^n- For large i'll, if is clear fhaf all considered performance indices regisfer thaf fhe 
adjusfed S mafrix is no longer close fo fhe frue value, I. 

In sfark confrasf, Figure shows fhe indices when applied fo fhe resulting precision mafrix (affer fhe 
firsf enfry in fhe covariance mafrix has been artificially inflafed). As expecfed, fhe condifion number of fhe 
resulting inverse is idenfical fo fhaf of fhe original covariance mafrix and Lo{S~^) = -Lo(S), however fhe 
ofher indices exhibif somewhaf differenf behaviour. In parficular fhe mafrix norms fend fo decrease, rafher 
fhan increase. This is explained by noting fhaf as fhe firsf elemenf in fhe covariance mafrix is arfificially 
inflafed, fhe firsf row and column of fhe precision mafrix decay fowards zero while fhe ofher elemenfs are 
more or less consfanf. This is demonsfrafed in Figure|^where Q\i and 621 bofh fend fowards zero whereas fhe 
ofher main diagonal and off diagonal elemenfs remain quife sfable as fhe level of confaminafion increases. 

In Figure fhe Frobenius norm, an elemenfwise norm, exhibifs a minimum fuming poinf before lev¬ 
elling off. This is due fo fhe firsf row and column of fhe precision mafrix converging rapidly fo zero, offen 
from relatively large sfarfing poinfs, whereas fhe convergence of fhe ofher elemenfs is nof as drasfic and nof 
necessarily shrinking fowards zero, hence fhe upward frend. 

The enfropy loss broadly exhibifs similar behaviour in bofh Figures and The minimum fuming 
poinf in Figure is somewhaf similar fo fhaf of fhe Frobenius norm and is explained by nofing fhaf fhis is 
due fo fhe sum of fhe eigenvalues decaying quife quickly before levelling off as increases, whereas fhe 
sum of fhe logs of fhe eigenvalues decays much more slowly. Regardless, if is clear fhaf fhe enfropy loss 
fends fo reflecf fhe impacf of fhe inflafed variance in bofh fhe covariance mafrix and fhe resulting precision 
mafrix. 
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Figure 2: The impact of artificially inflating the size of the top left element of the sample covariance matrix, Sn, on each of the 
performance indices when applied to the resulting covariance matrix. 








Figure 3: The impact of artificially inflating the size of the top left element of the sample covariance matrix, Sn, on each of the 
performance indices when applied to the resulting precision matrix. 
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Figure 4: The change in few elements of 0 = S 'as 5ii is artificially inflated. Main diagonal elements are in the top row and the 
off diagonal elements are in the bottom row. Note that S = (i,y) and 0 = (0,y). 

1.2.2. Contamination in the data 

Instead of directly manipulating the estimated covariance matrix, consider introducing contamination 
into the original data set and observing what effect that has on the performance metrics applied to the 
covariance and resulting precision matrix. For each level of contamination we take N = 1000 samples from 
For each sample we estimate the classical sample covariance matrix, S, and take the inverse to 
obtain © = 

Figures|^and|^show the behaviour of the various loss indices over N = 1000 replications. The horizon¬ 
tal axis represents the number of contaminated observations within each of the p = 30 variables. Starting 
with an uncontaminated multivariate Gaussian distribution, we then progressively add one contaminated 
observation to each variable until there are 24 contaminated observations within each variable. The con¬ 
tamination is performed by assigning to each randomly selected cell the value of 10. 

In general, cellwise outlying contamination will destroy any existing dependence structure and inflate 
the main diagonal of the covariance matrix, resulting in increases for the entropy loss and matrix norms 
applied to the covariance as seen in Figure]^ The log determinant of the estimated covariance matrix trends 
upwards, demonstrating the over estimated generalised variance with increasing levels of contamination. 
Apart from a relatively minor spike when there is only one contaminated observation in each variable, the 
condition number is not adversely affected by increasing levels of contamination, reflecting the stabilised 
eigenvalues of the resulting covariance matrix. This demonstrates that in this setting, the condition number 
is not an appropriate index against which to compare the performance of competing robust estimators. 

The interpretation of the performance indices when they are applied to the resulting precision matrix is 
more complicated. We see in Figure |^that there is still structure present in the precision matrix, in the sense 
that there is a main diagonal behaving distinctly from the off diagonal elements. However, all the elements 
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Figure 5: The impact of randomly contaminating a certain number of cells in each variable of a 100 x 30 data matrix on the various 
performance indices when applied to the resulting covariance matrix. 
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Figure 6: The impact of randomly contaminating a certain number of cells in each variable of a 100 x 30 data matrix on the various 
performance indices when applied to the resulting precision matrix. 


31 









tend to shrink towards zero. Hence, for large amounts of contamination, 0 -1 w -I and so the matrix norms 
tend to converge to || - I||. 

As in the previous scenario, the Frobenius norm exhibits a minimal turning point before plateauing. 
This is explained by noting that while the introduction of contamination has an immediate shrinkage effect 
on the main diagonal of the precision matrix, including one or two influential observations in each variable 
induces an artificially high level of correlation between some variables. Hence, it can take some time for 
the off diagonal elements to stabilise. When more than a few outlying cells are present in each variable, the 
artificial correlation structure wanes and hence the Frobenius norm applied to the resulting precision matrix 
trends towards ||I||f = ^Jp. Hence, in this contamination setting the matrix norms appear to be useful only 
when applied to the covariance matrix, not the precision matrix. 

As in the previous scenario, the entropy loss behaves consistently for both the covariance and precision 
matrix. Similarly to Figure it exhibits a slight drop when only one cell in each variable is contaminated 
after which it increases as the proportion of contaminated cells grows. As such, the entropy loss is the 
preferred performance index when looking across both covariance and precision matrix estimators. 


32 


